feat(ci): add HPC-grade CI/CD workflows, GPU hardware validation, and benchmark regression gate by Shards-inc · Pull Request #6 · Shards-foundation/kernels

Shards-inc · 2026-03-08T00:07:56Z

Motivation

Provide a production-grade CI/CD design for kernel-style compute projects that enforces deterministic builds, hardware-specific validation, and automated performance gates.
Surface expensive/hardware checks (GPU numerical correctness and benchmark regression) to separate, scheduled workflows so PR latency remains acceptable.

Description

Added a new architecture reference docs/pipelines/HPC_CICD_ARCHITECTURE.md describing the layered pipeline, deterministic build strategy, validation gates, rollback policy, and scaling guidance.
Introduced multiple GitHub Actions workflows: .github/workflows/hpc-matrix.yml (expanded CPU/Torch & CUDA compatibility matrix), .github/workflows/gpu-hardware.yml (self-hosted GPU hardware validation + benchmark artifact upload), .github/workflows/benchmark.yml (benchmark regression enforcement), and .github/workflows/docs.yml (strict docs build).
Added a benchmark regression utility scripts/ci/check_benchmark_regression.py and a placeholder baseline .ci/benchmarks/baseline.json, plus mkdocs.yml and links from README.md / docs/README.md to surface the new doc.
Updated CHANGELOG.md to record the CI/CD additions and committed the new files to the PR branch.

Testing

Ran bash scripts/smoke.sh (project smoke tests) which completed successfully. ✅
Compiled and executed the benchmark checker with python -m compileall scripts/ci/check_benchmark_regression.py and ran python scripts/ci/check_benchmark_regression.py .ci/benchmarks/baseline.json 0.05, both succeeded (baseline is empty so gate is skipped). ✅
Attempted docs build validation; mkdocs installation failed in this environment due to package index/proxy restrictions, so mkdocs build --strict could not be validated here. ⚠️
Ran ruff check . which failed repository-wide due to pre-existing lint issues unrelated to the changes in this PR; these lint failures are external to the new CI artifacts and should be addressed separately. ❌
Note: GPU hardware workflow is designed to run on self-hosted GPU runners and was not executed in this environment (requires labeled runners: a100, h100, rtx4090).

Codex Task

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27c24d6f85

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-08T00:11:03Z

.github/workflows/benchmark.yml

+          pip install pytest pytest-benchmark
+
+      - name: Run benchmark suite
+        run: pytest tests -m "benchmark" --benchmark-json benchmark.json -q


Run benchmark suite without an empty marker filter

pytest -m only executes tests matching the mark expression (pytest --help), and this repository currently has no benchmark-marked tests under tests/, so this command deselects everything and exits with code 5 (No tests collected). In this workflow context (PR and nightly), that causes the benchmark job to fail before the regression checker can run.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-08T00:11:03Z

.github/workflows/hpc-matrix.yml

+        run: pytest -m "not gpu" --cov=kernels --cov=implementations --cov-fail-under=85
+
+      - name: Integration tests
+        run: pytest tests -m "integration and not gpu" -q


Remove HPC marker filters that select zero tests

The integration step uses -m "integration and not gpu", but there are no matching marks in the current tests/ tree, so pytest deselects all tests and returns exit code 5; the same issue also affects the fallback step in this workflow. As written, both matrix jobs can fail even when the codebase is otherwise healthy.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-08T00:11:03Z

.github/workflows/gpu-hardware.yml

+
+      - name: GPU integration and numerical tests
+        run: |
+          pytest tests -m "gpu or numerical" -q


Fix GPU test selection to avoid no-tests-collected failures

This GPU validation step filters on gpu/numerical markers, but there are no such marks in the present test suite, so pytest exits 5 after deselecting everything. On scheduled/manual self-hosted runs, the workflow will fail at test selection rather than validating hardware behavior.

Useful? React with 👍 / 👎.

feat(ci): add hpc-grade workflow matrix and benchmark gates

27c24d6

Shards-inc added the codex label Mar 8, 2026 — with ChatGPT Codex Connector

Shards-inc merged commit 26dac06 into main Mar 8, 2026
7 of 25 checks passed

chatgpt-codex-connector bot reviewed Mar 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ci): add HPC-grade CI/CD workflows, GPU hardware validation, and benchmark regression gate#6

feat(ci): add HPC-grade CI/CD workflows, GPU hardware validation, and benchmark regression gate#6
Shards-inc merged 1 commit intomainfrom
codex/design-ci/cd-pipeline-for-gpu-kernels

Shards-inc commented Mar 8, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Shards-inc commented Mar 8, 2026

Motivation

Description

Testing

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant